[SPARK-3713][SQL] Uses JSON to serialize DataType objects by liancheng · Pull Request #2563 · apache/spark

liancheng · 2014-09-28T13:05:53Z

This PR uses JSON instead of toString to serialize DataTypes. The latter is not only hard to parse but also flaky in many cases.

Since we already write schema information to Parquet metadata in the old style, we have to reserve the old DataType parser and ensure downward compatibility. The old parser is now renamed to CaseClassStringParser and moved into object DataType.

@JoshRosen @davies Please help review PySpark related changes, thanks!

SparkQA · 2014-09-28T13:09:27Z

QA tests have started for PR 2563 at commit 26c6563.

This patch merges cleanly.

SparkQA · 2014-09-28T13:10:26Z

QA tests have finished for PR 2563 at commit 26c6563.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-28T13:10:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20938/

SparkQA · 2014-09-28T14:34:27Z

QA tests have started for PR 2563 at commit 03da3ec.

This patch merges cleanly.

SparkQA · 2014-09-28T16:34:28Z

Tests timed out after a configured wait of 120m.

AmplabJenkins · 2014-09-28T16:34:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20939/

SparkQA · 2014-09-30T06:34:16Z

QA tests have started for PR 2563 at commit 03da3ec.

This patch merges cleanly.

SparkQA · 2014-09-30T08:04:35Z

QA tests have finished for PR 2563 at commit 03da3ec.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-10-01T21:46:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

I think this comment is in the wrong place. We should probably note that this parser is deprecated and is only here for backwards compatibility. We might even print a warning when it is used so we can get rid of it eventually.

Ah, this comment is a mistake. Instead of print a warning, I made fromCaseClassString() private. It's only referenced by CaseClassStringParser, which has already been marked as deprecated.

marmbrus · 2014-10-01T21:47:37Z

Minor comment otherwise this LGTM.

davies · 2014-10-02T03:45:14Z

python/pyspark/sql.py

you can have default implementation as:

self.__class__.__name__.[:-4].lower()

Thanks for this, saved lots of boilerplate code! Removed all simpleString() method in subclasses.

SparkQA · 2014-10-02T15:34:31Z

QA tests have started for PR 2563 at commit 5169238.

This patch merges cleanly.

SparkQA · 2014-10-02T15:35:32Z

QA tests have finished for PR 2563 at commit 5169238.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-02T15:35:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21205/

SparkQA · 2014-10-02T22:59:29Z

QA tests have started for PR 2563 at commit 81e28fb.

This patch merges cleanly.

SparkQA · 2014-10-03T00:27:39Z

QA tests have finished for PR 2563 at commit 81e28fb.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GetPeers(blockManagerId: BlockManagerId) extends ToBlockManagerMaster

AmplabJenkins · 2014-10-03T00:27:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21227/

davies · 2014-10-03T07:23:19Z

python/pyspark/sql.py

why not just put _get_simple_string here? (it's not needed as separated functions, it will harder to understand without this context)

In order to make it available to class, it could be classmethod:

@classmethod def simpleString(cls): return cls.__name__[:-4].lower()

liancheng · 2014-10-04T07:27:04Z

@davis Thanks for all the suggestions, really makes things a lot cleaner!

SparkQA · 2014-10-04T07:29:38Z

QA tests have started for PR 2563 at commit 54c46ce.

This patch does not merge cleanly!

SparkQA · 2014-10-04T07:39:30Z

QA tests have started for PR 2563 at commit 785b683.

This patch merges cleanly.

SparkQA · 2014-10-04T08:55:53Z

QA tests have finished for PR 2563 at commit 785b683.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-04T08:55:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21291/Test PASSed.

SparkQA · 2014-10-04T09:29:38Z

Tests timed out after a configured wait of 120m.

davies · 2014-10-05T00:50:48Z

python/pyspark/sql.py

If you like to use single string for Primitive types, it's still doable, only use one layer dict for others.

Either one is good to me.

davies · 2014-10-05T00:52:02Z

This looks good to me, you just forget to rollback the changes in run-tests after debugging.

liancheng · 2014-10-05T01:12:26Z

@davies Sorry for my carelessness... And thanks again for all the great advices!

SparkQA · 2014-10-05T02:14:31Z

QA tests have started for PR 2563 at commit de18dea.

This patch merges cleanly.

SparkQA · 2014-10-05T03:45:46Z

QA tests have finished for PR 2563 at commit de18dea.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-05T03:45:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21305/Test PASSed.

davies · 2014-10-05T04:57:48Z

LGTM now, thanks!

davies · 2014-10-07T18:25:12Z

Could you rebase this to master?

liancheng · 2014-10-08T01:27:48Z

Finished rebasing.

SparkQA · 2014-10-08T01:34:39Z

QA tests have started for PR 2563 at commit fc92eb3.

This patch merges cleanly.

SparkQA · 2014-10-08T01:40:01Z

QA tests have started for PR 2563 at commit fc92eb3.

This patch merges cleanly.

SparkQA · 2014-10-08T02:47:49Z

QA tests have finished for PR 2563 at commit fc92eb3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-08T02:47:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21437/Test FAILed.

SparkQA · 2014-10-08T03:04:19Z

QA tests have finished for PR 2563 at commit fc92eb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(inputFile: String = null, threshold: Double = 0.1)
- class Word2VecModel(object):
- class Word2Vec(object):

davies · 2014-10-08T04:34:18Z

@marmbrus I think this is ready to go.

marmbrus · 2014-10-09T00:05:24Z

Thanks! I've merged this.

liancheng mentioned this pull request Sep 28, 2014

[SPARK-3421][SQL] Allows arbitrary character in StructField.name #2291

Closed

marmbrus reviewed Oct 1, 2014
View reviewed changes

davies reviewed Oct 2, 2014
View reviewed changes

davies reviewed Oct 3, 2014
View reviewed changes

liancheng force-pushed the datatype-to-json branch from 54c46ce to 785b683 Compare October 4, 2014 07:35

davies reviewed Oct 5, 2014
View reviewed changes

liancheng added 8 commits October 8, 2014 09:23

De/serializes DataType objects from/to JSON

f608c6e

Adds PySpark support

a983a6c

Adds compatibility est case for Parquet type conversion

99ab4ee

Addresses PEP8 issues

dc158b5

Addresses per review comments

6a3ee3a

Removes debugging code

6b6387b

Refactors PySpark DataType JSON SerDe per comments

438c75f

Reverts debugging code, simplifies primitive type JSON representation

fc92eb3

liancheng force-pushed the datatype-to-json branch from de18dea to fc92eb3 Compare October 8, 2014 01:27

liancheng mentioned this pull request Oct 8, 2014

[SPARK-3569][SQL] Add metadata field to StructField #2701

Closed

asfgit closed this in a42cc08 Oct 9, 2014

liancheng mentioned this pull request Oct 10, 2014

[SPARK-3407][SQL]Add Date type support #2344

Closed

liancheng deleted the datatype-to-json branch October 10, 2014 13:45

Conversation

liancheng commented Sep 28, 2014

Uh oh!

SparkQA commented Sep 28, 2014

Uh oh!

SparkQA commented Sep 28, 2014

Uh oh!

AmplabJenkins commented Sep 28, 2014

Uh oh!

SparkQA commented Sep 28, 2014

Uh oh!

SparkQA commented Sep 28, 2014

Uh oh!

AmplabJenkins commented Sep 28, 2014

Uh oh!

SparkQA commented Sep 30, 2014

Uh oh!

SparkQA commented Sep 30, 2014

Uh oh!

marmbrus Oct 1, 2014

Choose a reason for hiding this comment

Uh oh!

liancheng Oct 2, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Oct 1, 2014

Uh oh!

davies Oct 2, 2014

Choose a reason for hiding this comment

Uh oh!

liancheng Oct 3, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 2, 2014

Uh oh!

SparkQA commented Oct 2, 2014

Uh oh!

AmplabJenkins commented Oct 2, 2014

Uh oh!

SparkQA commented Oct 2, 2014

Uh oh!

SparkQA commented Oct 3, 2014

Uh oh!

AmplabJenkins commented Oct 3, 2014

Uh oh!

davies Oct 3, 2014

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 4, 2014

Uh oh!

SparkQA commented Oct 4, 2014

Uh oh!

SparkQA commented Oct 4, 2014

Uh oh!

SparkQA commented Oct 4, 2014

Uh oh!

AmplabJenkins commented Oct 4, 2014

Uh oh!

SparkQA commented Oct 4, 2014

Uh oh!

davies Oct 5, 2014

Choose a reason for hiding this comment

Uh oh!

davies commented Oct 5, 2014

Uh oh!

liancheng commented Oct 5, 2014

Uh oh!

SparkQA commented Oct 5, 2014

Uh oh!

SparkQA commented Oct 5, 2014

Uh oh!

AmplabJenkins commented Oct 5, 2014

Uh oh!

davies commented Oct 5, 2014

Uh oh!

davies commented Oct 7, 2014

Uh oh!

liancheng commented Oct 8, 2014

Uh oh!

SparkQA commented Oct 8, 2014